MSDA - Bootcamp 2025 Summer
KT Wong
Faculty of Social Sciences, HKU
2025-07-30
readr package is part of the tidyverse and is used to read data into RThe dataset used here is a subset of the Add Health dataset
After the dataset is loaded in R, it is important to explore the data to understand its structure and content
spc_tbl_ [3,000 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ id : num [1:3000] 1 2 3 4 5 6 7 8 9 10 ...
$ age : num [1:3000] 18 22 18 26 27 21 19 27 18 25 ...
$ gender : chr [1:3000] "female" "male" "female" "female" ...
$ income : num [1:3000] 19252 11617 16189 18194 24484 ...
$ logincome : num [1:3000] 9.87 9.36 9.69 9.81 10.11 ...
$ debt : chr [1:3000] "yesdebt" "nodebt" "yesdebt" "yesdebt" ...
$ love : num [1:3000] 1 10 10 2 5 10 3 4 1 6 ...
$ nocheating : num [1:3000] 7 10 3 1 10 4 10 10 10 3 ...
$ money : num [1:3000] 9 3 5 3 9 9 9 7 3 8 ...
$ paypercent : num [1:3000] 46 56 42 82 93 42 89 55 43 53 ...
$ logpaypercent: num [1:3000] 3.83 4.03 3.74 4.41 4.53 ...
- attr(*, "spec")=
.. cols(
.. id = col_double(),
.. age = col_double(),
.. gender = col_character(),
.. income = col_double(),
.. logincome = col_double(),
.. debt = col_character(),
.. love = col_double(),
.. nocheating = col_double(),
.. money = col_double(),
.. paypercent = col_double(),
.. logpaypercent = col_double()
.. )
- attr(*, "problems")=<externalptr>
# A tibble: 5 × 11
id age gender income logincome debt love nocheating money paypercent
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 18 female 19252. 9.87 yesdebt 1 7 9 46
2 2 22 male 11617. 9.36 nodebt 10 10 3 56
3 3 18 female 16189. 9.69 yesdebt 10 3 5 42
4 4 26 female 18194. 9.81 yesdebt 2 1 3 82
5 5 27 female 24484. 10.1 yesdebt 5 10 9 93
# ℹ 1 more variable: logpaypercent <dbl>
18 19 20 21 22 23 24 25 26 27
306 299 300 315 303 265 301 278 296 337
# A tibble: 3 × 1
id
<dbl>
1 1
2 2
3 3
# A tibble: 2,999 × 11
id age gender income logincome debt love nocheating money paypercent
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 2 22 male 11617. 9.36 nodebt 10 10 3 56
2 3 18 female 16189. 9.69 yesdebt 10 3 5 42
3 4 26 female 18194. 9.81 yesdebt 2 1 3 82
4 5 27 female 24484. 10.1 yesdebt 5 10 9 93
5 6 21 female 22353. 10.0 nodebt 10 4 9 42
6 7 19 male 11842. 9.38 yesdebt 3 10 9 89
7 8 27 female 19874. 9.90 nodebt 4 10 7 55
8 9 18 male 27422. 10.2 nodebt 1 10 3 43
9 10 25 female 9968. 9.21 yesdebt 6 3 8 53
10 11 24 female 26354. 10.2 nodebt 10 10 10 52
# ℹ 2,989 more rows
# ℹ 1 more variable: logpaypercent <dbl>
[1] 15127.34
[1] 22.51133
[1] "yesdebt" "nodebt"
id age gender income
Min. : 1.0 Min. :18.00 Length:3000 Min. : 1008
1st Qu.: 750.8 1st Qu.:20.00 Class :character 1st Qu.: 9372
Median :1500.5 Median :22.00 Mode :character Median :15127
Mean :1500.5 Mean :22.51 Mean :15231
3rd Qu.:2250.2 3rd Qu.:25.00 3rd Qu.:20518
Max. :3000.0 Max. :27.00 Max. :41700
logincome debt love nocheating
Min. : 3.292 Length:3000 Min. : 1.000 Min. : 1.000
1st Qu.: 9.222 Class :character 1st Qu.: 5.000 1st Qu.: 5.000
Median : 9.650 Mode :character Median :10.000 Median :10.000
Mean : 9.482 Mean : 7.707 Mean : 7.694
3rd Qu.: 9.939 3rd Qu.:10.000 3rd Qu.:10.000
Max. :10.638 Max. :10.000 Max. :10.000
NA's :97
money paypercent logpaypercent
Min. : 1.000 Min. : 1.00 Min. :0.000
1st Qu.: 3.000 1st Qu.: 25.00 1st Qu.:3.219
Median : 6.000 Median : 51.00 Median :3.932
Mean : 5.569 Mean : 50.45 Mean :3.629
3rd Qu.: 8.000 3rd Qu.: 76.00 3rd Qu.:4.331
Max. :10.000 Max. :100.00 Max. :4.605
The dplyr package is part of the tidyverse and is used for data manipulation
dplyr functions include:
very important function: pipe operator %>% from the magrittr package
basic structure of the dplyr functions
dplyr - selectstarts_with()ends_with()contains()matches()# A tibble: 5 × 2
paypercent logpaypercent
<dbl> <dbl>
1 46 3.83
2 56 4.03
3 42 3.74
4 82 4.41
5 93 4.53
[1] 1096
[1] 3000
sort data based on one or more columns
task: find the two observations who think money is extremely important for a relationship (10 on money) but who pay for the fewest percentage of dates (paypercent)
# A tibble: 2 × 11
id age gender income logincome debt love nocheating money paypercent
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 811 22 male 34161. 10.4 yesdebt 10 9 10 2
2 2086 20 male 4816. 8.48 yesdebt 10 10 10 2
# ℹ 1 more variable: logpaypercent <dbl>
create new variables added to the dataset
task: add a variable with the average rating for nocheating, money, and love’s importance for a relationship (sum divided by 3) and another variable that logs that rating
# A tibble: 5 × 13
id age gender income logincome debt love nocheating money paypercent
<dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 18 female 19252. 9.87 yesdebt 1 7 9 46
2 2 22 male 11617. 9.36 nodebt 10 10 3 56
3 3 18 female 16189. 9.69 yesdebt 10 3 5 42
4 4 26 female 18194. 9.81 yesdebt 2 1 3 82
5 5 27 female 24484. 10.1 yesdebt 5 10 9 93
# ℹ 3 more variables: logpaypercent <dbl>, rateavg <dbl>, rateavglog <dbl>
group data by one or more variables and then summarize the data according to the groups
task: find the average “not cheating importance” for different gender
# A tibble: 2 × 2
gender mean_nocheating
<chr> <dbl>
1 female 7.79
2 male 7.60
# A tibble: 4 × 4
# Groups: gender [2]
gender debt percentage n_distinct_love
<chr> <chr> <dbl> <int>
1 female nodebt 0.256 10
2 female yesdebt 0.245 10
3 male nodebt 0.248 10
4 male yesdebt 0.251 10
# A tibble: 4 × 5
# Groups: gender [2]
gender debt mean_love mean_nocheating mean_money
<chr> <chr> <dbl> <dbl> <dbl>
1 male yesdebt 7.76 7.72 5.66
2 female yesdebt 7.57 7.75 5.59
3 female nodebt 7.82 7.83 5.54
4 male nodebt 7.68 7.47 5.49
mutate() function to change the data type of a variable
# A tibble: 3 × 13
id age gender income logincome debt love nocheating money paypercent
<dbl> <chr> <chr> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 1 18 female 19252. 9.87 yesdebt 1 7 9 46
2 2 22 male 11617. 9.36 nodebt 10 10 3 56
3 3 18 female 16189. 9.69 yesdebt 10 3 5 42
# ℹ 3 more variables: logpaypercent <dbl>, rateavg <dbl>, rateavglog <dbl>
[1] 18 21 23 25 27 30
[1] "numeric"
[1] "male" "female" "other" "female" "female" "male"
[1] "character"
[1] "28" "28" "TRUE"
[1] "character"
[1] "18" "21" "23" "25" "27" "30"
[1] NA NA NA NA NA NA
[1] male female other female female male
Levels: male female other
[1] "factor"
[1] 1 2 3 2 2 1
[1] 1 1 1 1 1
[1] 1997 2002 2007 2012 2017 2022
[1] "age_22" "age_23" "age_24" "age_25" "age_26" "age_27" "age_28" "age_29"
[9] "age_30"
# A tibble: 6 × 2
age gender
<dbl> <chr>
1 18 male
2 21 female
3 23 other
4 25 female
5 27 female
6 30 male
Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 1 2 ...
convert the variable gender in addh to a factor variable
what happens if you try to convert the variable to character by using as.character after the factor conversion
what happens if you try to convert the variable to number by using as.numeric after the factor conversion
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
[1] 3 5
A B C D E
X 1 4 7 10 13
Y 2 5 8 11 14
Z 3 6 9 12 15
[1] 8
A B C D E
2 5 8 11 14
[,1] [,2]
[1,] 76 103
[2,] 100 136
[,1] [,2] [,3]
[1,] 27 61 95
[2,] 30 68 106
[3,] 33 75 117
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
[,1] [,2] [,3]
[1,] 1.14705882 -0.2058824 -0.08823529
[2,] -0.05882353 -0.1176471 0.23529412
[3,] -0.61764706 0.2647059 -0.02941176
[1] -34
eigen() decomposition
$values
[1] 12.502029 -3.320941 0.818912
$vectors
[,1] [,2] [,3]
[1,] -0.1946720 -0.07646107 -0.8799055
[2,] -0.7265653 -0.79546586 0.1194865
[3,] -0.6589428 0.60115536 0.4598796
let us start from discussing logical operators first
the main logical operators used in R are:
# A tibble: 5 × 14
id money_over_love age gender income logincome debt love nocheating
<dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 18 female 19252. 9.87 yesdebt 1 7
2 2 0 22 male 11617. 9.36 nodebt 10 10
3 3 0 18 female 16189. 9.69 yesdebt 10 3
4 4 1 26 female 18194. 9.81 yesdebt 2 1
5 5 1 27 female 24484. 10.1 yesdebt 5 10
# ℹ 5 more variables: money <dbl>, paypercent <dbl>, logpaypercent <dbl>,
# rateavg <dbl>, rateavglog <dbl>
# A tibble: 5 × 14
id money_or_love age gender income logincome debt love nocheating money
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 money greater 18 female 19252. 9.87 yesd… 1 7 9
2 2 love greater 22 male 11617. 9.36 node… 10 10 3
3 3 love greater 18 female 16189. 9.69 yesd… 10 3 5
4 4 money greater 26 female 18194. 9.81 yesd… 2 1 3
5 5 money greater 27 female 24484. 10.1 yesd… 5 10 9
# ℹ 4 more variables: paypercent <dbl>, logpaypercent <dbl>, rateavg <dbl>,
# rateavglog <dbl>
use case_when() if there are 3 or more conditions for creating a variable
its syntax is the following:
# A tibble: 5 × 14
id money_or_love age gender income logincome debt love nocheating money
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 money greater 18 female 19252. 9.87 yesd… 1 7 9
2 2 love greater 22 male 11617. 9.36 node… 10 10 3
3 3 love greater 18 female 16189. 9.69 yesd… 10 3 5
4 4 money greater 26 female 18194. 9.81 yesd… 2 1 3
5 5 money greater 27 female 24484. 10.1 yesd… 5 10 9
# ℹ 4 more variables: paypercent <dbl>, logpaypercent <dbl>, rateavg <dbl>,
# rateavglog <dbl>
money_or_love in the addh dataset
# A tibble: 5 × 15
id income_level age gender income logincome debt love nocheating money
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 medium 18 female 19252. 9.87 yesde… 1 7 9
2 2 medium 22 male 11617. 9.36 nodebt 10 10 3
3 3 medium 18 female 16189. 9.69 yesde… 10 3 5
4 4 medium 26 female 18194. 9.81 yesde… 2 1 3
5 5 high 27 female 24484. 10.1 yesde… 5 10 9
# ℹ 5 more variables: paypercent <dbl>, logpaypercent <dbl>, rateavg <dbl>,
# rateavglog <dbl>, high_income <dbl>
[1] 5.57076
[1] 0.5221372 -0.4693026 0.1243644 0.3847371 1.2013581 0.9246830
[7] -0.4400798 0.6028251 1.5828892 -0.6833786
age_sample_means<- sample_means(addh$age, 500, 1000)
ggplot(as.data.frame(age_sample_means), aes(age_sample_means)) +
geom_density()+
geom_vline(xintercept = mean(age_sample_means),
color="red", linetype="dashed") +
labs(title="Distribution of Sample Means",
x="Sample Means of Age",
y="Frequency")+
theme_bw()create a function that incorporate both sample means functions and the density plot as above
plot the distribution of sample means for the variable love in the addh dataset using the function
purrr library[[1]]
[1] 1
[[2]]
[1] 8 9 10 11 12
[[3]]
x y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
purrr librarymap(mylist, myfunction, functionoptions) and can change depending on the type of output for your analysis[[1]]
[1] 1
[[2]]
[1] 5
[[3]]
[1] 2
[[1]]
[1] "numeric"
[[2]]
[1] "numeric"
[[3]]
[1] "data.frame"
[[1]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1 1 1 1 1
[[2]]
Min. 1st Qu. Median Mean 3rd Qu. Max.
8 9 10 10 11 12
[[3]]
x y
Min. : 1.00 Min. :11.00
1st Qu.: 3.25 1st Qu.:13.25
Median : 5.50 Median :15.50
Mean : 5.50 Mean :15.50
3rd Qu.: 7.75 3rd Qu.:17.75
Max. :10.00 Max. :20.00
# A tibble: 3 × 4
name day1 day2 day3
<chr> <dbl> <dbl> <dbl>
1 KT 8 6 5
2 Olivia 7 6 4
3 Dean 6 5 4
# A tibble: 9 × 3
name day sleep_hours
<chr> <chr> <dbl>
1 KT day1 8
2 KT day2 6
3 KT day3 5
4 Olivia day1 7
5 Olivia day2 6
6 Olivia day3 4
7 Dean day1 6
8 Dean day2 5
9 Dean day3 4
# A tibble: 9 × 3
name day sleep_hours
<chr> <chr> <dbl>
1 KT 1 8
2 KT 2 6
3 KT 3 5
4 Olivia 1 7
5 Olivia 2 6
6 Olivia 3 4
7 Dean 1 6
8 Dean 2 5
9 Dean 3 4
# A tibble: 3 × 4
name day1 day2 day3
<chr> <dbl> <dbl> <dbl>
1 KT 8 6 5
2 Olivia 7 6 4
3 Dean 6 5 4
# A tibble: 6 × 5
name activity day1 day2 day3
<chr> <chr> <dbl> <dbl> <dbl>
1 KT sleep 8 6 5
2 KT play 2 1 1
3 Olivia sleep 7 1 4
4 Olivia play 2 3 1
5 Dean sleep 5 6 4
6 Dean play 3 2 3
# A tibble: 9 × 4
name day sleep play
<chr> <chr> <dbl> <dbl>
1 KT 1 8 2
2 KT 2 6 1
3 KT 3 5 1
4 Olivia 1 7 2
5 Olivia 2 1 3
6 Olivia 3 4 1
7 Dean 1 5 3
8 Dean 2 6 2
9 Dean 3 4 3
sleep_tidy<- tibble(name=c("KT", "Olivia", "Dean", "May", "Mary"),
sleep=c(8, 7, 6, 5, 5),
play=c(2, 2, 3, 3, 2))
sleep_tidy2<- tibble(name=c("KT", "Olivia", "Dean", "Peter", "Susan"),
study=c(3, 4, 5, 2, 3),
work=c(8, 10, 9, 6, 5))
sleep_tidy3<- left_join(sleep_tidy, sleep_tidy2, by="name")
sleep_tidy3# A tibble: 5 × 5
name sleep play study work
<chr> <dbl> <dbl> <dbl> <dbl>
1 KT 8 2 3 8
2 Olivia 7 2 4 10
3 Dean 6 3 5 9
4 May 5 3 NA NA
5 Mary 5 2 NA NA
# A tibble: 3 × 5
name sleep play study work
<chr> <dbl> <dbl> <dbl> <dbl>
1 KT 8 2 3 8
2 Olivia 7 2 4 10
3 Dean 6 3 5 9
it starts from the grammar of graphics Wickham (2016)
library(ggthemes)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color=class)) +
labs(x="Engine size (litres)",
y="Highway fuel economy (miles per gallon)",
title="Relationship between engine size and fuel economy",
color="Car type",
caption="Source: mpg dataset")+
theme_economist()+
scale_color_tableau()# A tibble: 6 × 8
partner year partner_name product product_name US_report_import pop2000
<chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 ARE 1998 United Arab Emira… 950341 "Toys repre… 1.06 3.25e6
2 ARE 2000 United Arab Emira… 950349 "Toys repre… 12.0 3.25e6
3 ARE 2003 United Arab Emira… 950349 "Toys repre… 4.65 3.25e6
4 ARE 2005 United Arab Emira… 950320 "Reduced-si… 49.2 3.25e6
5 ARG 1996 Argentina 950341 "Toys repre… 0 3.69e7
6 ARG 1996 Argentina 950310 "Electric t… 10.8 3.69e7
# ℹ 1 more variable: region <dbl>
# A tibble: 5 × 2
partner_name total_import
<chr> <dbl>
1 China 26842305.
2 Denmark 1034990.
3 Canada 572309.
4 Hong Kong, China 545186.
5 Switzerland 400969.
#| out-width: 100%
top5_partners=c("China", "Denmark", "Canada", "Hong Kong, China", "Switzerland")
options(scipen = 999)
library(ggthemes)
library(scales)
library(plotly)
p <- toy_imports %>%
filter(partner_name %in% top5_partners) %>%
group_by(year, partner_name) %>%
summarize(total_import=sum(US_report_import)) %>%
ggplot(aes(year, total_import, color=partner_name)) +
geom_line()+
labs(title="Toy imports from the U.S.'s top-5 partners, 1996-2005",
x="Year",
y="Dollar value of imports (log scale)",
color="Import Region")+
scale_x_continuous(breaks=1996:2005)+
theme_economist()+
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
ggplotly(p)# install.packages('devtools')
#devtools::install_github('bbc/bbplot'))
library(ggpubr)
source("https://raw.githubusercontent.com/kwan-MSDA/R/main/bbc_style.R")
gapminder %>%
group_by(year, continent) %>%
summarize(median_lifeExp = median(lifeExp)) %>%
ggplot(aes(year, median_lifeExp, color=continent)) +
geom_line()+
labs(title="Life expectancy by continent and year",
x="Year",
y="Life expectancy")+
bbc_style()library("ggalt")
library("tidyr")
library(gapminder)
dumbbell_df <- gapminder %>%
filter(year == 1967 | year == 2007) %>%
select(country, year, lifeExp) %>%
spread(year, lifeExp) %>%
mutate(gap = `2007` - `1967`) %>%
arrange(desc(gap)) %>%
head(10)
#Make plot
ggplot(dumbbell_df, aes(x = `1967`, xend = `2007`, y = reorder(country, gap), group = country)) +
geom_dumbbell(colour = "#dddddd",
size = 3,
colour_x = "#FAAB18",
colour_xend = "#1380A1") +
bbc_style() +
labs(title="We're living longer",
subtitle="Biggest life expectancy rise, 1967-2007")library(hrbrthemes)
library(viridis)
gapminder %>%
filter(year==2007) %>%
mutate(country=factor(country, levels=unique(country))) %>%
arrange(desc(pop)) %>%
ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent)) +
geom_point(alpha=0.6, shape=21, color="black")+
scale_size(range=c(.1, 24), name="Population (M)")+
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
theme_ipsum()+
theme(legend.position="none")+
labs(title="Life expectancy by continent in 2007",
x="GDP per capita",
y="Life Expectancy")library(gganimate)
gapminder %>%
ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent, frame=year)) +
geom_point(alpha=0.6, shape=21, color="black")+
scale_size(range=c(.1, 22), name="Population (M)")+
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
theme_ipsum()+
theme(legend.position="none")+
labs(title="Life expectancy by continent in {frame_time}",
x="GDP per capita",
y="Life Expectancy")+
geom_text(data=gapminder %>% filter(pop >1e+8), aes(label=country), size=5, nudge_x=0.1, nudge_y=0.1)+
transition_time(year)+
enter_fade()+
exit_fade()library(plotly)
library(hrbrthemes)
library(viridis)
g<- crosstalk::SharedData$new(gapminder %>%
mutate(country=factor(country, levels=unique(country))) %>%
arrange(desc(pop)),
~ continent)
gg<- g %>%
ggplot(aes(x=gdpPercap, y=lifeExp, fill=continent, frame=year)) +
geom_point(aes(size=pop, alpha=0.6, ids=country))+
scale_size(range=c(.1, 24), name="Population (M)")+
scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
scale_alpha(range=c(0.6, 1), guide=FALSE)+
theme_ipsum()+
# theme(legend.position="none")+
labs(title="Life expectancy by continent between 1952-2007",
x="GDP per capita",
y="Life Expectancy")
ggplotly(gg, height = 500, width = 800)